Objectives

Data Science is a field of science. We try to extract useful information from data. In order to use the data efficiently and correctly we must understand the data first. According to the goal of the study, combining the domain knowledge, we then design the study. In this lecture we first go through some basic explore data analysis to understand the nature of the data, some plausible relationship among the variables.

Data mining tools have been expanded dramatically in the past 20 years. Linear model as a building block for data science is simple and powerful. We introduce/review linear models. The focus is to understand what is it we are modeling; how to apply the data to get the information; to understand the intrinsic variability that statistics have.

Suggested reading:

  • This lecture in details
  • Chapter 3 of ISLR
  • Stat Sleuth: Ch 7- 10 Multiple Reg, Ch 5: One Way ANOVA
  • Note: Many important topics are in Appendices. Please go through them.
  • Data needed: car_04_regular.csv and Cars_04.csv
  1. Elements to multiple regression

    • Model specification
    • General linear models
    • LS estimates and their properties
    • Inference for the coefficients for the mean response *for a future response
  2. Categorical predictors

    • Adding categorical predictors to the multiple regression (ANOVA)
    • Model with both continuous and categorical variables (ANCOVA)
    • Models with or without interactions
  3. More about multiple Regression

    • Sandwich standard error estimates to handle heteroscedasticity vs. homoscedasticity
    • Modeling non-linear relationships
    • Transformation of variables to meet linear model assumptions
    • Outliers, Leverage points
  4. Summary

  5. Appendices

  6. R functions

    • lm()
    • anova()
    • Anova()
    • sandwich, lmtest::coeftest: sandwich cov estimators
    • stargazer (organize table, lm output)

1 Case Study: Fuel Efficiency in Automobiles

When an automobile manufacture designs a new car, what is the benchmark to compare with in terms of fuel efficiency? How well can we predict housing price based on the house characteristics? Multiple regression is an extension of simple regression. It explores relationship between one variable (response) with many other variables (predictors or features). We highlight the regression methods in this lecture. It is important to understand the effect of each predictor on the response depending on what else are included in the model. Our approach here unifies usual multiple regression and ANOVA.

Goal of the study: How to build a fuel efficient car?

  • Effects of features on a car
  • Given a set of features, we’d like to estimate the mean fuel efficiency as well as the efficiency of one car
  • Are Asian cars more efficient than cars built in other regions?

We will use the car_04_regular.csv data set to perform a case study on fuel efficiency of automobiles. Below is a quick description of all the features in our dataset.

Feature Description
Continent Continent the Car Company is From
Horsepower Horsepower of the Vehicle
Weight Weight of the vehicle (thousand lb)
Length Length of the Vehicle (inches)
Width Width of the Vehicle (inches)
Seating Number of Seats in the Vehicle
Cylinders Number of Engine Cylinders
Displacement Volume displaced by the cylinders, defined as \(\pi/4 \times bore^{2} \times stroke \times \text{Number of Cylinders}\)
Transmission Type of Transmission (manual, automatic, continuous)

Once we know the data structure we will rephrase our goal of study. Specifically, fuel efficiency is measured by Mileage per Gallon, \(Y\) = MPG_City

  1. Predictors: Effects of each feature on \(Y\)

  2. Estimate - the mean MPG_City for all such cars specified below and - predict \(Y\) for the particular car described below

  3. Are cars built by Asian more efficient?

  4. In particularly, we would like to investigate the MPG_City for this newly designed American car

Feature Value
Continent America
Horsepower 225
Weight 4
Length 180
Width 80
Seating 5
Cylinders 4
Displacement 3.5
Transmission automatic

1.1 Exploratory Data Analysis

It is always crucial to look at the data first. For the purpose of shortening the lecture time, we put this part in Appendix 1 which explains how we cleaned the original data Car_04.csv and created a new version called car_04_regular.csv.

To summarize the EDA quickly, we excluded some high-performance cars with large Horsepower (HP > 400). We also take a subset of features, containing only the features described in the description table.

1.2 A quick glimpse

data1 <- read.csv("data/car_04_regular.csv", header=TRUE)

names(data1)
##  [1] "Make.Model"   "Continent"    "MPG_City"     "MPG_Hwy"     
##  [5] "Horsepower"   "Weight"       "Length"       "Width"       
##  [9] "Seating"      "Cylinders"    "Displacement" "Make"        
## [13] "Transmission"
dim(data1)   # 226 cars and 13 variables
## [1] 226  13
str(data1)
## 'data.frame':    226 obs. of  13 variables:
##  $ Make.Model  : Factor w/ 226 levels "Acura_MDX","Acura_NSX",..: 3 5 6 4 2 1 7 8 9 10 ...
##  $ Continent   : Factor w/ 3 levels "Am","As","E": 2 2 2 2 2 2 3 3 3 3 ...
##  $ MPG_City    : int  18 20 23 25 17 17 20 18 17 16 ...
##  $ MPG_Hwy     : int  24 28 32 34 24 23 28 25 24 22 ...
##  $ Horsepower  : int  225 270 200 160 252 265 170 220 330 250 ...
##  $ Weight      : num  3.9 3.58 3.32 2.77 3.2 ...
##  $ Length      : num  197 189 183 172 174 ...
##  $ Width       : num  71.6 72.2 69.4 67.9 71.3 77 76.3 76.1 74.6 76.1 ...
##  $ Seating     : int  5 5 5 4 2 7 5 5 5 5 ...
##  $ Cylinders   : int  6 6 4 4 6 6 4 6 6 6 ...
##  $ Displacement: num  3.5 3.2 2.4 2 3 3.5 1.8 3 4.2 2.7 ...
##  $ Make        : Factor w/ 38 levels "Acura","Audi",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ Transmission: Factor w/ 3 levels "automatic","cont_variable",..: 1 1 1 1 1 1 1 1 1 1 ...
head(data1) #View(data1)
##   Make.Model Continent MPG_City MPG_Hwy Horsepower Weight Length Width
## 1   Acura_RL        As       18      24        225   3.90    197  71.6
## 2   Acura_TL        As       20      28        270   3.58    189  72.2
## 3  Acura_TSX        As       23      32        200   3.32    183  69.4
## 4  Acura_RSX        As       25      34        160   2.77    172  67.9
## 5  Acura_NSX        As       17      24        252   3.20    174  71.3
## 6  Acura_MDX        As       17      23        265   4.45    189  77.0
##   Seating Cylinders Displacement  Make Transmission
## 1       5         6          3.5 Acura    automatic
## 2       5         6          3.2 Acura    automatic
## 3       5         4          2.4 Acura    automatic
## 4       4         4          2.0 Acura    automatic
## 5       2         6          3.0 Acura    automatic
## 6       7         6          3.5 Acura    automatic

We create a new car for prediction later as follows.

newcar <-  data1[1, ]  # Create a new row with same structure as in data1
newcar[1] <- "NA" # Assign features for the new car
newcar[2] <- "Am"
newcar["MPG_City"] <- "NA"
newcar["MPG_Hwy"] <- "NA"
newcar[5:11] <- c(225, 4, 180, 80, 5, 4, 3.5)
newcar$Make <-  "NA"
newcar[13] <- "automatic"
newcar
##   Make.Model Continent MPG_City MPG_Hwy Horsepower Weight Length Width
## 1         NA        Am       NA      NA        225      4    180    80
##   Seating Cylinders Displacement Make Transmission
## 1       5         4          3.5   NA    automatic

2 Introduction to Multiple regression

2.1 Model Specification

How does Length affect MPG_City?

It depends on how we model the response. We will investigate three models with Length.

For the ease of presentation, we define some predictors below that we will use in subsequent models:

\[x_1 = Length, \quad x_2 = Horsepower, \quad x_3 = Width, \quad x_4 = Seating, \quad x_5 = Cylinders, \quad x_6 = Displacement\]

We will create three models:

M1. Our first model will only contain one predictor, Length:

\[y_i = \beta_0 + \beta_1 \cdot x_{i1} + \epsilon\]

Interpretation of \(\beta_1\) is that in general, the mean \(y\) will change by \(\beta_1\) if a car is 1" longer. So we can’t really peel off the effect of the Length over \(y\).

M2. Next, we add the predictor Horsepower to our model

\[y_i = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2} + \epsilon\]

Interpretation of \(\beta_1\) is that in general, the mean \(y\) will change by \(\beta_1\) if a car is 1" longer and the ’Horse Power’s` are the same.

M3. Finally, we fit a model with multiple predictors

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{i4} + \beta_5x_{i5} + \beta_6x_{i6} + \epsilon\]

Interpretation of \(\beta_1\) is that in general, the mean \(y\) will change by \(\beta_1\) if a car is 1" longer and the rest of the features are the same.

Question: Are all the \(\beta_1\)s same in the 3 models above?

No. The effect of Length \(\beta_1\) depends on the rest of the features in the model!!!!

2.2 General multiple linear models

In general, We define a multiple regression as

\[Y = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \epsilon_i\]

  • Linearity Assumption for this model is

\[\textbf{E}(y_i | x_i) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} \]

  • The homoscedasticity assumption is

\[\textbf{Var}(y_i | x_{i1}, x_{i2}, \dots, x_{ip}) = \sigma^2\]

  • The Error term is iid. and defined as

\[\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)\] This is same to say the errors are independent and with a normal distribution of mean \(0\) and an equal variance \(\sigma^2\).

  • Parameters of interests are all the \(\beta's\) and \(\sigma^2\).

  • Simple interpretations of each \(\beta_i's\)

  • Simple models for predictions

2.3 OLS and its Properties

These \(\beta\) parameters are estimated using the same approach as simple regression, specifically by minimizing the sum of squared residuals (\(RSS\)):

\[\min_{b_0,\,b_1,\,b_{2},\dots,b_{p}} \sum_{i=1}^{n} (y_i - b_0 - b_1 x_{i1} - b_2 x_{i2} - \dots - b_p x_{ip})^{2}\]

To be specific:

We define let \(X\) denote a \(n × (p+1)\) matrix whose \((i,j)\)th element is \(x_{ij}\). That is:

\[X= \begin{pmatrix} 1& x_{11} & x_{12} & \dots & x_{1p} \\ 1& x_{21} & x_{22} & \dots & x_{1p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1& x_{n1} & x_{n2} & \dots & x_{np} \end{pmatrix}\]

2.3.1 OLS Estimates

Mathematical formula for the OLS estimates:

The least squared estimators can be obtained precisely by \[\hat\beta = (X^{T}X)^{-1} X^{T}Y \]

Variance of the OLS:

The variance-covariance matrix of \(\hat \beta\)

\[\textbf{V}[\hat\beta | X] = \sigma^{2} (X^{T}X)^{-1}\] Remark:

The above covariance matrix is obtained with a homeostatic variance of \(y's\).

Normality of the OLS:

Under all the linear model assumptions, the LSE’s are unbiased estimators of \(\beta\) and they are normally distributed. \[ \hat\beta | X \sim N(0, \sigma^{2} (X^{T}X)^{-1}\])

Standard errors

As soon as we know how to estimate common \(\sigma^2\), we can estimate \(\textbf{V}[\hat\beta | X]\) by \(\hat {\textbf{V}}[\hat\beta | X] = \hat \sigma^{2} (X^{T}X)^{-1}\). The standard error for each \(\hat \beta_i\) will be \[\text {se}\, (\hat \beta_i) = \sqrt{( \hat \sigma^{2} (X^{T}X)^{-1} )}|_{ith\,\text{Diag}}\]

Confidence intervals each \(\beta_i\):

  • a $95%% t-interval for \(\beta_i\) is \[\hat \beta_i \pm t^*_{.025} \text{se}(\hat \beta_i).\]

  • Here \(t^*_{.025}\) is the upper .025 percentile from a \(t\) distribution. You may take it to be approximately \(1.96\) if we have have a large number of observations \(n\).

  • \(\text{se}(\hat \beta_i)\) is taken from the variance formula above.

  • It is a t-confidence interval because we need to estimate the unknown \(\sigma^2\).

Hypothesis test for each \(\beta_i\):

To test that \[\beta_i = 0 \;\;\; \text{vs.}\;\;\; \beta_i = 0\]

which means that given other variables in the model, there is no \(x_i\) effect. We carry out a t-test:

\[ \text{tstat} = \frac{\hat \beta_i - 0}{\text{se} (\hat \beta_i)}\] The p-value is: \[\text{p-value} = 2 \times P(\text{T variable} > \text{tstat})\].

We reject the null hypothesis at an \(\alpha\) level if the p-value is < \(\alpha\).

A \(95\%\) Confidence interval for the mean given a set of predictors:

\[\hat y \pm \,\,t^*_{.025}\,\, \times se(\hat y) \]

A \(95\%\) prediction interval for a future \(y\) given a set of predictors: \[\hat y \pm \,\,t^*_{.025}\,\, \times \hat \sigma.\]

2.3.2 Residual Sum of Squares (RSS)

For multiple regression, \(RSS\) is estimated as:

\[RSS = \sum_{i=1}^{n} \hat{\epsilon}_i^{2} = \sum_{i=1}^{n} (y_i-\hat y_i)^2= \sum_{i=1}^{n} (y_i - (\hat\beta_0 + \hat\beta_1 x_{i1} + \hat\beta_2 x_{i2} + \dots + \hat\beta_p x_{ip}))^{2}\]

2.3.3 Mean Sum of Squares (MSE), Residual Standard Error (RSE)

\(\sigma^2\) is estimated by the Mean Sum of Squares (MSE). For multiple regression, MSE is defined as:

\[MSE = \frac{RSS}{n-p-1} \\ \text{where }p =\text{number of predictors}\]

\(\sigma\) is estimated by the Residual Standard Error (RSE). For multiple regression, RSE is defined as:

\[\hat \sigma= RSE = \sqrt{MSE} = \sqrt{\frac{RSS}{n-p-1}}\] Remark:

Notice the denominator in \(MSE\) is \(n-(p+1)\), where \(p+1\) is the number of predictors in the model +1. By doing so we have a nice mathematical property: \(RSE\) is an unbiased estimator of \(\sigma^2\). In another word, in a long run, on average the estimator hits the quantity being estimated.

2.3.4 Goodness of Fit: \(R^2\)

Total Sum of Squares (TSS):

TSS measures the total variance in the response \(Y\). When all the responses are the same which means no predictors affect the response values, then \(TSS=0\). In contrast, RSS measures the amount of variability that is left unexplained after performing the regression.

\[TSS = \sum_{i=1}^{n} (y_i - \bar{y})^{2}\]

\(R^2\):

How much variability is captured in the linear model using this set of predictors? \(R^{2}\) measures the proportion of variability in \(Y\) that can be explained using this set of predictors in the model.

\[R^{2} = \frac{TSS - RSS}{TSS}\] Remark 1:

\(TSS \geq RSS\). Why so??? \(R^2 \leq 1\). \(TSS=RSS + \sum (\hat y_i - \bar y) ^2\). \((corr (y, \hat y))^2\)

An \(R^{2}\) statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression.

Remark2:

How large \(R^2\) needs to be so that you are comfortable to use the linear model? Though \(R^2\) is a very popular notion of goodness of fit, but it has its limitation. Mainly all the sum of squared errors defined so far are termed as Training Errors. It really only measures how good a model fits the data that we use to build the model. It may not generate well to unseen data.

2.4 R function `lm()

Linear models are so popular due to their nice interpretations as well as clean solutions. R-function lm() will be. It takes a model specification together with other options, outputs all the estimators, summary statistics such as varies sum of squares, standard errors of estimators, testing statistics and p-values. Finally the predicted values with margin of errors or confidence intervals and prediction intervals can be called.

2.5 Compare three models

Linear model effects or coefficients are defined within the model, they depend on the rest of the variables in the model. Let us compare the three model fits, focusing on the interpretations over the coefficients of length in different models.

2.5.1 Model 1: `MPG_City ~ Length

We first do a simple regression. Let the response \(y_i\) be the MPG_City and the explanatory variable \(x_{i1}\) be Length (\(i = 1, \dots, n=226\)).

\[y_i = \beta_0 + \beta_1 \cdot x_{i1} + \epsilon\]

Model assumptions:

  1. Linearity: given Length, the mean of MPG_City is described as \(\textbf{E}(y_i | x_i) = \beta_0 + \beta_1 x_i\)
  2. Equal variance: the variances of MPG_City are the same for any Length i.e. \(\textbf{Var}(y_i | x_{i1}) = \sigma_{1}^2\)
  3. Normality: MPG_City is independent and normally distributed. i.e. \(\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma_{1}^2)\)

We now create a model with lm()

fit1 <- lm(MPG_City ~ Length, data = data1)    # model specification response ~ x1,.. 
ggplot(data1, aes(x = Length , y = MPG_City)) + 
  geom_point() +
  geom_smooth(method="lm",se=F) + 
geom_hline(aes(yintercept = mean(MPG_City)), color = "red") 
## `geom_smooth()` using formula 'y ~ x'

Note from the summary below, the \(\hat\beta\) for Length is estimated as -0.14. We say on average MPG drops \(.13983\) if a car is \(1\)’’ longer.

fit1 <- lm(MPG_City ~ Length, data = data1)  # model one 
summary(fit1) 
## 
## Call:
## lm(formula = MPG_City ~ Length, data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.626  -2.279  -0.151   1.977  10.731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.3138     2.8975   15.64   <2e-16 ***
## Length       -0.1398     0.0155   -9.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.18 on 224 degrees of freedom
## Multiple R-squared:  0.266,  Adjusted R-squared:  0.263 
## F-statistic: 81.2 on 1 and 224 DF,  p-value: <2e-16
# predict(fit1, newcar, interval ="prediction",  se.fit = TRUE)  
# TSS <- sum((data1$MPG_City - mean(data1$MPG_City))^2)
# RSS <- sum((fit1$res)^2); RSS
# model diagnosese
# plot(fit1$fit, fit1$res); same as plot(fit1, 1) # linearity and homoscedasticity
# qqnorm(fit1$res); qqline(fit1$res) # looking for a well fitted straight line
# plot(fit1, 2)

2.5.2 Model 2: MPG_City ~ Length + Horsepower

Let the response \(y_i\) be the MPG_City and the explanatory variables be Length and Horsepower (\(i = 1, \dots, n=226\)).

\[y_i = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2} + \epsilon\]

fit2 <- lm(MPG_City ~ Length + Horsepower, data = data1) 
stargazer::stargazer(fit2, type=output_format, align=TRUE)
Dependent variable:
MPG_City
Length -0.062***
(0.013)
Horsepower -0.037***
(0.003)
Constant 38.600***
(2.230)
Observations 226
R2 0.591
Adjusted R2 0.587
Residual Std. Error 2.380 (df = 223)
F Statistic 161.000*** (df = 2; 223)
Note: p<0.1; p<0.05; p<0.01

A couple quick notes:

  1. The order for predictors plays no role.
  2. The output of lm() is similar regardless how many predictors are used.
  3. The model assumptions extended automatically here.
summary(fit2)  #sum((fit2$res)^2)
## 
## Call:
## lm(formula = MPG_City ~ Length + Horsepower, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.152 -1.558  0.154  1.492  8.563 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.62587    2.22525   17.36  < 2e-16 ***
## Length      -0.06191    0.01300   -4.76  3.5e-06 ***
## Horsepower  -0.03690    0.00277  -13.31  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.38 on 223 degrees of freedom
## Multiple R-squared:  0.591,  Adjusted R-squared:  0.587 
## F-statistic:  161 on 2 and 223 DF,  p-value: <2e-16

NOTICE: Comparing fit2: Length + Horsepower to fit1: Length

  1. The coefficient for Length changed to \(-0.061651\) from \(-0.13983\)!!!!
  2. \(R^{2}\) is \(0.5908\) from \(0.266\) - a huge increase. (Never decreasing, why?)
  3. \(RSE\) (Residual standard error) is now \(2.372\) decreased from \(3.175\) (almost never decreasing, why?)

The effects of variables (\(\beta's\)) are defined within the model. They all depend on what else are included or accounted for!

2.5.3 Model 3: Several continuous variables

We now add multiple variables to our model:

\[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2x_{i2} + \beta_3x_{i3} + \beta_4x_{i4} + \beta_5x_{i5} + \beta_6x_{i6} + \epsilon\]

fit3 <- lm(MPG_City ~ Length + Horsepower + Width + Seating +
           Cylinders + Displacement, data = data1)
summary(fit3)  
## 
## Call:
## lm(formula = MPG_City ~ Length + Horsepower + Width + Seating + 
##     Cylinders + Displacement, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.877 -1.462  0.073  1.149  8.261 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.63372    4.02793   11.33  < 2e-16 ***
## Length        0.04909    0.01730    2.84    0.005 ** 
## Horsepower   -0.02000    0.00408   -4.90  1.8e-06 ***
## Width        -0.35358    0.06879   -5.14  6.1e-07 ***
## Seating      -0.24135    0.14955   -1.61    0.108    
## Cylinders    -0.27169    0.23292   -1.17    0.245    
## Displacement -0.93813    0.37166   -2.52    0.012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.04 on 219 degrees of freedom
## Multiple R-squared:  0.705,  Adjusted R-squared:  0.697 
## F-statistic: 87.2 on 6 and 219 DF,  p-value: <2e-16
# shift+c+command: to comment the portion selected 
# confint(fit3)
# 2*pnorm(2.84, lower.tail = FALSE) # pvalues
# newcar
# predict(fit3, newcar)
# RSS <- sum((fit3$residual)^2)
# MSE <- RSS/219
# RSE <- sqrt(MSE)   

Based on the output from model 3:

\(\hat\beta_{Length}\) = 0.049 for model 3. Is this estimate wrong?

To summarize, the effects of Length from the above three models are

data.frame(Model.1 = coef(fit1)[2], Model.2 = coef(fit2)[2], Model.3 = coef(fit3)[2])
   Model.1 Model.2 Model.3

Length -0.14 -0.0619 0.0491

# process the lm output into a table
stargazer(fit1, fit2, fit3, type = output_format, 
                     # ci=TRUE, ci.level = .95,
                     keep.stat = c("n", "rsq", "sigma2", "ser"))
Dependent variable:
MPG_City
(1) (2) (3)
Length -0.140*** -0.062*** 0.049***
(0.016) (0.013) (0.017)
Horsepower -0.037*** -0.020***
(0.003) (0.004)
Width -0.354***
(0.069)
Seating -0.241
(0.150)
Cylinders -0.272
(0.233)
Displacement -0.938**
(0.372)
Constant 45.300*** 38.600*** 45.600***
(2.900) (2.230) (4.030)
Observations 226 226 226
R2 0.266 0.591 0.705
Residual Std. Error 3.170 (df = 224) 2.380 (df = 223) 2.040 (df = 219)
Note: p<0.1; p<0.05; p<0.01
  • They are different as expected
  • Each one has its own meaning!

2.6 Inference for a fixed model

Given a fixed model with several predictors, say model 3, assume all the linear model assumptions are met. Let us walk through the possible outcomes and point out the limitations.

Fit the model:

fit3 <- lm(MPG_City ~ Length + Horsepower + Width + Seating +
           Cylinders + Displacement, data = data1)
summary(fit3) ### key output
## 
## Call:
## lm(formula = MPG_City ~ Length + Horsepower + Width + Seating + 
##     Cylinders + Displacement, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.877 -1.462  0.073  1.149  8.261 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.63372    4.02793   11.33  < 2e-16 ***
## Length        0.04909    0.01730    2.84    0.005 ** 
## Horsepower   -0.02000    0.00408   -4.90  1.8e-06 ***
## Width        -0.35358    0.06879   -5.14  6.1e-07 ***
## Seating      -0.24135    0.14955   -1.61    0.108    
## Cylinders    -0.27169    0.23292   -1.17    0.245    
## Displacement -0.93813    0.37166   -2.52    0.012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.04 on 219 degrees of freedom
## Multiple R-squared:  0.705,  Adjusted R-squared:  0.697 
## F-statistic: 87.2 on 6 and 219 DF,  p-value: <2e-16
# shift+c+command: to comment the portion selected 
# confint(fit3)
# 2*pnorm(2.84, lower.tail = FALSE) # pvalues
# newcar
# predict(fit3, newcar)
# RSS <- sum((fit3$residual)^2)
# MSE <- RSS/219
# RSE <- sqrt(MSE)
## clean summary table and arrange it by tstat
fit3.summary <- summary(fit3)   #names(summary(fit3))
    data.frame(fit3.summary %>% coefficients) %>%
    rename(estimate = Estimate,
          se = "Std..Error",
          tstat = "t.value",
          pvalue = "Pr...t..") %>%
    mutate(abs.tstat = abs(tstat)) %>%
  arrange(desc(abs.tstat)) %>%
knitr::kable()

Questions of interests:

Students please fill in the answers with output from R.

  1. Write down the final OLS equation of MPG_City given the rest of the predictors.

  2. What is the standard error from the output? Precisely what does it measure?

  3. Interpret the \(R^2\) reported for this model. Do you feel comfortable using the output for the following questions based on this \(R^2\) value?

  4. What does each t-interval and t-test do?

  5. Can you conclude that with \(95\%\) confidence all the t confidence intervals provided above are right simultaneously?

  6. Is Width THE most important variable, HP the second, etc since they each has the smallest p-value,…

  7. Can we conclude that none of the Seating or Cylinders is needed?

  8. Is Width most useful variable due to its largest coefficient in magnitude?

  9. Can we conclude that none of the Seating or Cylinders is needed?

2.7 Confidence Interval for the Mean

Base on Model 3, the mean of MPG_City among all cars with the same features as the new design: length=180, HP=225, width=80, seating=5, cylinders=4, displacement=3.5, transmission=“automatic”, continent=“Am” is

\[\hat{y} = 45.63 + 0.05 \times 180 - 0.02 \times 225 - 0.35 \times 80 - 0.24 \times 5 - 0.27 \times 4 - 0.94 \times 3.5 = 16.17, \] with a \(\text{standard error}\) for the regression line = 0.7839001.

We will get the prediction, standard error and the confident intervals by predict().

predict(fit3, newcar, interval = "confidence", se.fit = TRUE) 
## $fit
##    fit  lwr  upr
## 1 16.1 14.6 17.7
## 
## $se.fit
## [1] 0.784
## 
## $df
## [1] 219
## 
## $residual.scale
## [1] 2.04

Q: What assumptions are needed to make this a valid confidence interval?

2.8 Prediction Interval

Base on Model 3, MPG_City for this particular new design is

\[\hat{y} = 45.63 + 0.05 \times 180 - 0.02 \times 225 - 0.35 \times 80 - 0.24 \times 5 - 0.27 \times 4 - 0.94 \times 3.5 = 16.17\] with a 95% prediction interval approximately to be

\[\hat{y} \pm 2\times RSE = 16.17 \pm 2 \times 2.036.\]

The prediction interval can be obtained by predict().

predict(fit3, newcar,  interval = "predict", se.fit = TRUE) # future prediction intervals
## $fit
##    fit  lwr  upr
## 1 16.1 11.8 20.4
## 
## $se.fit
## [1] 0.784
## 
## $df
## [1] 219
## 
## $residual.scale
## [1] 2.04

Q: What assumptions are needed to make this a valid prediction interval?

2.9 Model Diagnoses

To check the model assumptions are met, we examine the residual plot and the qqplot of the residuals.

We use the first and second plots of plot(fit).

par(mfrow=c(1,2), mar=c(5,2,4,2), mgp=c(3,0.5,0)) # plot(fit3) produces several plots
plot(fit3, 1, pch=16) # residual plot
abline(h=0, col="blue", lwd=2)
plot(fit3, 2) # qqplot

Are the linear model assumptions met for the model fit here (fit3)? What might be violated? -linearity? -Equal variances? -Normality?

3 F-test

With multiple variables, we need to test whether a set of \(\beta\)’s being 0. For example, we want to ask whether \(\beta_\text{Seating}\) and \(\beta_\text{Cylinder}\) are not useful. Mathematically, we test the null hypothesis

\[H_0: \beta_\text{Seating}=\beta_\text{Cylinder}=0 \mbox{ vs. } H_1: \text{at least one of } \beta_\text{Seating},\beta_\text{Cylinder} \text{ is non-zero}\]

To put it in another way, \(H_0\) is the reduced model:

\[y_i = \beta_0 + \beta_{\text{Length}} x_{\text{i,Length}} + \beta_{\text{HP}}x_{\text{i,HP}} + \beta_{\text{Width}}x_{\text{i, Width}} + \beta_{\text{Displacement}}x_{\text{i, Displacement}} + \epsilon_i\]

and \(H_1\) is the full model:

\[y_i = \beta_0 + \beta_{\text{Length}} x_{\text{i,Length}} + \beta_{\text{HP}}x_{\text{i,HP}} + \beta_{\text{Width}}x_{\text{i, Width}} + \\ \beta_{\text{Seating}}x_{\text{i, Seating}} + \beta_{\text{Cylinders}}x_{\text{i, Cylinders}} + \beta_{\text{Displacement}}x_{\text{i, Displacement}} + \epsilon_i\]

Theorem: Assume the linear model assumptions hold, then under the \(H_0\)

\[F_{\text{stat}}:=\frac{\frac{RSS(H_0)-RSS(H_1)}{df_1}}{\frac{RSS(H_1)}{df_2}} \sim F_{df_1, df_2}\] where \[\begin{split} {df}_1 &= \text{\# of parameters in } H_1 - \text{\# of parameters in } H_0 \\ {df}_2 &= \text{\# of observations} - \text{\# of parameters in } H_1 \end{split}.\]

The p-value is

\[\text{p-value}=P(F_{{df}_1, {df}_2} > F_{\text{stat}})\]

To explore, we can run a model without Seating and Cylinder.

fit3.0 <- lm(MPG_City ~ Horsepower + Length + Width + Displacement, data = data1) # under H0
sum((summary(fit3.0)$residuals)^2) # RSS(H_0)
sum((summary(fit3)$residuals)^2) # RSS(H_1)

\[F_{\text{stat}}=\frac{\frac{RSS(H_0)-RSS(H_1)}{df_1}}{\frac{RSS(H_1)}{df_2}} = \frac{\frac{923.82 - 907.76}{2}}{\frac{907.76}{219}}=1.9373.\]

Based on the \(F\)-test given above, the \(p\)-value is \[P(F_{2,219} > 1.9373)\]

pf(1.9373, 2, 219, lower.tail=F)  
## [1] 0.147
# hist(rf(10000, 2, 219), breaks=40)    # take a look at the F dis

3.1 anova()

We can use the anava() function to carry out the above F-test anova(H_0, H_1).

fit3.0 <- lm(MPG_City ~ Horsepower + Length + Width + Displacement, data = data1)
anova(fit3.0, fit3)
## Analysis of Variance Table
## 
## Model 1: MPG_City ~ Horsepower + Length + Width + Displacement
## Model 2: MPG_City ~ Length + Horsepower + Width + Seating + Cylinders + 
##     Displacement
##   Res.Df RSS Df Sum of Sq    F Pr(>F)
## 1    221 924                         
## 2    219 908  2      16.1 1.94   0.15

Conclusion: The p-value being .1466 gives us no evidence to reject the null hypothesis.

Questions based on the summary for fit3

  1. What is the \(F\) statistics from summary(fit3)?

  2. Can you perform a \(F\)-test for each \(\beta\) being 0? What is the relationship between the F-test and the t-test?

TSS <- sum((data1$MPG_City - mean(data1$MPG_City))^2)   # (226-1)*var(data1$MPG_City)
TSS

Roughly speaking, the linearity, homoscedasticity and normality assumptions are satisfied. (Very subjective!)

4 Categorical Predictors

Let’s use Continent as one variable. It has three categories. We explore the following questions:

  1. Are Asian cars more efficient?
  2. How does Continent affect the MPG?
levels(data1$Continent)   #data1$Continent
## [1] "Am" "As" "E"

Model with a categorical variable is termed as a “One Way ANOVA”

First, we get the sample means and sample standard error for each group, and plot each

data1 %>%
group_by(Continent) %>%
  summarise(
    mean = mean(MPG_City),
    sd  = sd(MPG_City),
    n = n()
  )
## # A tibble: 3 x 4
##   Continent  mean    sd     n
## * <fct>     <dbl> <dbl> <int>
## 1 Am         18.7  2.99    74
## 2 As         20.2  4.20    98
## 3 E          18.3  3.22    54
ggplot(data1) + geom_boxplot(aes(x = Continent, y = MPG_City))

We use indicator predictors to analyze the effect of Continent.

4.1 Model 1 - One Way ANOVA

\[y_{i|x=Am} = \mu_{Am} + \epsilon_i\] \[y_{i|x=As} = \mu_{As} + \epsilon_i\] \[y_{i|x=E} = \mu_{E} + \epsilon_i\] where \(\mu_{Am}\), \(\mu_{As}\) and \(\mu_{E}\) are the mean MPG of each continent and \(\epsilon \sim \mathcal{N}(0,\sigma^2)\). We want to compare the three means and this can be done through linear model with indicator functions.

We are interested in the following hypotheses:

\[H_0: \mu_{Am} = \mu_{As} = \mu_{E}\]

Let

  • Am as the base

  • \(x_1\) be the indicator function of being As, i.e. \(I\{Continent = As\}\)

  • \(x_2\) be the indicator function of being E \(I\{Continent = E\}\)

Then the above Anova model is same as

\[y_i = \beta_{Am} + \beta_{As} x_{1,i} + \beta_E x_{2,i} + \epsilon_i = \beta_{Am} + \beta_{As} I\{Continent = As\} + \beta_E I\{Continent = E\} + \epsilon_i\] where * \(\beta_{Am} = \textbf{E}(MPG|Am)\)

  • \(\beta_{As} = \textbf{E}(MPG|As)-\textbf{E}(MPG|Am)\): the increment between Asia and America

  • \(\beta_E = \textbf{E}(MPG|E)-\textbf{E}(MPG|Am)\): the increment between Europe and America

To test \[H_0: \mu_{Am} = \mu_{As} = \mu_{E}\] is same as to test \[H_0: \beta_{As} = \beta_{E} = 0\]

Now we can use lm() to fit the model and carry out the tests

fit.continent <- lm(MPG_City ~ Continent, data1)
summary(fit.continent)  #confint(fit.continent); Anova(fit.continent); anova(fit.continent)
## 
## Call:
## lm(formula = MPG_City ~ Continent, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.259 -2.730 -0.245  1.755 12.755 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.730      0.420   44.62   <2e-16 ***
## ContinentAs    1.515      0.556    2.72   0.0069 ** 
## ContinentE    -0.470      0.646   -0.73   0.4673    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.61 on 223 degrees of freedom
## Multiple R-squared:  0.0552, Adjusted R-squared:  0.0467 
## F-statistic: 6.52 on 2 and 223 DF,  p-value: 0.00178

The \(F_{test}\) here is to test \(H_0: \; \beta_{As} = \beta_{E} = 0\), i.e., there is no difference among the three regions. We reject \(H_0\) at \(\alpha=0.01\), and conclude the mean MPG’s are different among the three regions.

We can also see the coding from here that Am is the base, each coefficient captures the effect of the level vs. Am.

model.matrix(fit.continent)[5:10, ]

Remarks: fit.continent can also be obtained from

fit.continent.1 <- lm(MPG_City ~ I(Continent == "Am") + I(Continent == "E") , data1)
summary(fit.continent.1)    
model.matrix(fit.continent.1)[5:10, ]

We also show a model without intercept \(\beta_0\). Can you interpret the summary table below?

fit.continent.2 <- lm(MPG_City ~ 0 + Continent , data1) #
summary(fit.continent.2)  
## 
## Call:
## lm(formula = MPG_City ~ 0 + Continent, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.259 -2.730 -0.245  1.755 12.755 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## ContinentAm   18.730      0.420    44.6   <2e-16 ***
## ContinentAs   20.245      0.365    55.5   <2e-16 ***
## ContinentE    18.259      0.491    37.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.61 on 223 degrees of freedom
## Multiple R-squared:  0.967,  Adjusted R-squared:  0.966 
## F-statistic: 2.15e+03 on 3 and 223 DF,  p-value: <2e-16
model.matrix(fit.continent.2)[5:10, ] # try to show you the indicator variables created
##    ContinentAm ContinentAs ContinentE
## 5            0           1          0
## 6            0           1          0
## 7            0           0          1
## 8            0           0          1
## 9            0           0          1
## 10           0           0          1

Questions from the above fit:

  • What does each t-test do?
  • What does \(R^2\) mean?
  • What does the F test do here?

constrast()

To estimate the mean difference between two regions, we can use the contrast() function from package contrast.

contrast(fit.continent, list(Continent = 'As'), list(Continent = 'Am'))
contrast(fit.continent, list(Continent = 'As'), list(Continent = 'E'))  # comp As with E

Changing base level

We can choose the base level as we want. We demonstrate this by setting up European cars as the base category:

data1$Continent <- factor(data1$Continent, levels = c("E", "Am", "As"))  
summary(lm(MPG_City ~ Continent, data1))
## 
## Call:
## lm(formula = MPG_City ~ Continent, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.259 -2.730 -0.245  1.755 12.755 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.259      0.491   37.16   <2e-16 ***
## ContinentAm    0.470      0.646    0.73   0.4673    
## ContinentAs    1.986      0.612    3.24   0.0014 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.61 on 223 degrees of freedom
## Multiple R-squared:  0.0552, Adjusted R-squared:  0.0467 
## F-statistic: 6.52 on 2 and 223 DF,  p-value: 0.00178

Note that:

  1. The \(F_{test}\) should be the same.
  2. The coefficients are different but the mean estimates are the same.
# Change it back
data1$Continent <- factor(data1$Continent, levels = c("Am", "As", "E")) 

Drawback of one way Anova: other factors should be also taken into consideration.

4.2 Model 2 - Model without Interaction

Let us take one important variable Horsepower in addition to the Continent. We first show three scatter plots by region with lm line added. Though the slopes seem to be little different for the sake of simple interpretation we will fit a usual additive model without interaction, that is, we assume the effect of HP over MPG_City are the same.

ggplot(data1, aes(x = Horsepower, y = MPG_City, color=Continent)) + 
  geom_point() + 
  geom_smooth(method = "lm" , se =F) + 
  facet_wrap(~ Continent) + labs(title="Highway MPG by Weight")
## `geom_smooth()` using formula 'y ~ x'

# ggplot(data1, aes(x = Horsepower, y = MPG_City, color = Continent)) +
#   geom_point() +
#   geom_smooth(method = "lm" , se =F) +
#   labs(title = "Model With Interactions")

We next plot our models without interactions.

fit.no.interation <- lm(MPG_City ~ Continent+Horsepower, data1)
coefs <- (summary(fit.no.interation))$coefficients[, 1]

ggplot(data1, aes(x = Horsepower, y = MPG_City, color = Continent)) + 
  geom_point() + 
  geom_abline(intercept =coefs[1], slope=coefs[4], color ="#F8766D") + 
  geom_abline(intercept =coefs[1]+coefs[2], slope=coefs[4], color = "#00BA38") +
  geom_abline(intercept =coefs[1]+coefs[3], slope=coefs[4], color ="#619CFF") +
  labs(title = "Model Without Interactions")

For a model without interaction, we assume that the effect of HP on MPG_{City} is the same regardless of which region the cars are produced.

\[MPG = \beta_{Am} + \beta_{As} \cdot I(Continent = As) + \beta_E \cdot I(Continent = E) + \beta_1 \cdot HP + \epsilon\]

fit.no.interation <- lm(MPG_City ~ Horsepower + Continent, data1)
summary(fit.no.interation)   
## 
## Call:
## lm(formula = MPG_City ~ Horsepower + Continent, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.948 -1.486  0.053  1.476  8.863 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.64616    0.62290   44.38   <2e-16 ***
## Horsepower  -0.04212    0.00261  -16.11   <2e-16 ***
## ContinentAs  1.03893    0.37958    2.74   0.0067 ** 
## ContinentE   0.44071    0.44341    0.99   0.3213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.46 on 222 degrees of freedom
## Multiple R-squared:  0.564,  Adjusted R-squared:  0.558 
## F-statistic: 95.9 on 3 and 222 DF,  p-value: <2e-16

To test whether Continent is significant after controlling for Horsepower, we can use anova(H_0, H_1).

anova(lm(MPG_City ~ Horsepower, data1), fit.no.interation  )    
## Analysis of Variance Table
## 
## Model 1: MPG_City ~ Horsepower
## Model 2: MPG_City ~ Horsepower + Continent
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)  
## 1    224 1386                           
## 2    222 1340  2        46 3.81  0.024 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value = 0.0237, a weak evidence of rejecting the null hypothesis that there is no Continent effects.

4.2.1 Anova()

We can also use Anova() from the car package.

Anova(fit.no.interation)    ### anova(fit.no.interation) # doesn't give us the right F test!!!! It is a sequencial test: given the variables above, do we need the current one?
## Anova Table (Type II tests)
## 
## Response: MPG_City
##            Sum Sq  Df F value Pr(>F)    
## Horsepower   1567   1  259.47 <2e-16 ***
## Continent      46   2    3.81  0.024 *  
## Residuals    1340 222                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We fail to reject \(H_0\) at \(\alpha=0.01\), i.e., controlling HP, we do not have the evidence to say the effect of Continent is different.

Remark: Models without interactions are simple and we can test Continent effects directly. But we may ask ourselves is this assumption reasonable?

4.3 Model 3 - Model with Interaction

It is possible that HP effects depend on Continent.

ggplot(data1, aes(x = Horsepower, y = MPG_City, color = Continent)) + 
  geom_point() + 
  geom_smooth(method = "lm" , se =F) + 
  labs(title = "Model With Interactions")
## `geom_smooth()` using formula 'y ~ x'

As we can see from the plot, Asian cars with smaller HP are more efficient; while European cars with larger HP are more efficient. To capture this effect, we introduce models with interactions MPG ~ Horsepower by Continent.

\[y_{i|Am,\, HP} = \beta_{Am} + \beta_{1_{Am}} \cdot HP + \epsilon_i\] \[y_{i|As,\, HP} = \beta_{As} + \beta_{1_{As}} \cdot HP + \epsilon_i\]

\[y_{i|E,\, HP} = \beta_{E} + \beta_{1_{E}} \cdot HP + \epsilon_i\]

This can be done through the following indicators:

\[MPG= \beta_{Am} + \beta_{As} \cdot I(Continent = As) + \beta_E \cdot I(Continent = E) + \\ \beta_{1_{Am}} \cdot HP + \beta_{1_{As}} \cdot HP \cdot I(Continent = As) + \beta_{1_E} \cdot HP \cdot I(Continent = E) + \epsilon\]

Similarly,

  • \(\beta_{As}\) is the increment of intercept between Asian and American, …

  • \(\beta_{1_{As}}\) is the increment of slope between Asian and American, …

Use lm() with interaction:

fit.with.interaction <- lm(MPG_City ~ Horsepower*Continent, data1)
summary(fit.with.interaction)    #model.matrix(fit.with.interaction)[1:3,]
## 
## Call:
## lm(formula = MPG_City ~ Horsepower * Continent, data = data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.792 -1.596 -0.021  1.610  7.794 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            26.60189    1.04938   25.35  < 2e-16 ***
## Horsepower             -0.03718    0.00478   -7.78  2.8e-13 ***
## ContinentAs             4.40301    1.35269    3.25   0.0013 ** 
## ContinentE             -0.73505    1.51377   -0.49   0.6278    
## Horsepower:ContinentAs -0.01651    0.00629   -2.63   0.0092 ** 
## Horsepower:ContinentE   0.00458    0.00654    0.70   0.4842    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.4 on 220 degrees of freedom
## Multiple R-squared:  0.59,   Adjusted R-squared:  0.58 
## F-statistic: 63.3 on 5 and 220 DF,  p-value: <2e-16
#summary(lm(MPG_City ~ Horsepower + Continent + Horsepower*Continent, data1)) # this is same as fit.with.interaction

Question: Is the interaction effect significant? Are the effects of HP the same across Continent?

To test there is no interaction is same as to test \[H_0: \beta_{1_{AS}} = \beta_{1_E} = 0\] which we can do with anova()

anova(fit.no.interation, fit.with.interaction)
## Analysis of Variance Table
## 
## Model 1: MPG_City ~ Horsepower + Continent
## Model 2: MPG_City ~ Horsepower * Continent
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)   
## 1    222 1340                            
## 2    220 1262  2      78.3 6.82 0.0013 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(fit.with.interaction)
## Anova Table (Type II tests)
## 
## Response: MPG_City
##                      Sum Sq  Df F value Pr(>F)    
## Horsepower             1567   1  273.07 <2e-16 ***
## Continent                46   2    4.01 0.0196 *  
## Horsepower:Continent     78   2    6.82 0.0013 ** 
## Residuals              1262 220                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We do reject the null hypothesis with a \(p\)-value to be \(.0013\), not terribly small. That indicates some interaction effect of HP and Continent.

So the final conclusion would be

  • Asian cars with smaller HP are more efficient;
  • European cars with larger HP are more efficient.

The interpretation of the model is no longer neat.

5 A more flexible model

We now try a model which includes all sensible variables.

names(data1)
names.exclude <- names(data1) %in% c("Make.Model", "MPG_Hwy", "Make")
# names.exclude <- c("Make.Model", "MPG_Hwy", "Make") # doesn't work.
data2 <- data1[!names.exclude]  # Take a subset with all var's but "Make.Model","MPG_Hwy","Make"
names(data2)    #str(data2)   table(data2$Transmission)

Or use dplyr

data2 <- select(data1, -Make.Model, -MPG_Hwy, -Make)
names(data2)
##  [1] "Continent"    "MPG_City"     "Horsepower"   "Weight"      
##  [5] "Length"       "Width"        "Seating"      "Cylinders"   
##  [9] "Displacement" "Transmission"

QUESTION: What would have happened if you have added Make.Model into the model???

fit.all <- lm(MPG_City ~., data2)    #convenient way to include all variables
summary(fit.all)
## 
## Call:
## lm(formula = MPG_City ~ ., data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.275 -1.024  0.038  0.913  6.933 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               29.54629    3.78785    7.80  2.7e-13 ***
## ContinentAs                0.46536    0.29121    1.60  0.11151    
## ContinentE                 0.05009    0.42770    0.12  0.90688    
## Horsepower                -0.01331    0.00389   -3.42  0.00075 ***
## Weight                    -3.50280    0.38932   -9.00  < 2e-16 ***
## Length                     0.03313    0.01492    2.22  0.02737 *  
## Width                     -0.01212    0.06742   -0.18  0.85749    
## Seating                    0.31648    0.14234    2.22  0.02724 *  
## Cylinders                 -0.19980    0.20535   -0.97  0.33166    
## Displacement              -0.16762    0.39266   -0.43  0.66988    
## Transmissioncont_variable  1.22163    0.99566    1.23  0.22119    
## Transmissionmanual        -0.24546    0.55587   -0.44  0.65924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.68 on 214 degrees of freedom
## Multiple R-squared:  0.803,  Adjusted R-squared:  0.793 
## F-statistic: 79.2 on 11 and 214 DF,  p-value: <2e-16
data3 <- data2
data3$Cylinders <- as.factor(data3$Cylinders)
Anova(lm(MPG_City ~., data3) )   # interestingly Cyliners is very sig as a categorical variable.
## Anova Table (Type II tests)
## 
## Response: MPG_City
##              Sum Sq  Df F value  Pr(>F)    
## Continent        10   2    2.13   0.121    
## Horsepower       15   1    6.42   0.012 *  
## Weight          212   1   90.03 < 2e-16 ***
## Length           13   1    5.35   0.022 *  
## Width             1   1    0.30   0.584    
## Seating          13   1    5.67   0.018 *  
## Cylinders       113   4   12.07 7.5e-09 ***
## Displacement      6   1    2.67   0.104    
## Transmission      8   2    1.68   0.189    
## Residuals       496 211                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The \(t\)-table is pretty messy, many variables are not significant, need to do Anova()

Anova(fit.all)
## Anova Table (Type II tests)
## 
## Response: MPG_City
##              Sum Sq  Df F value  Pr(>F)    
## Continent        10   2    1.79 0.16972    
## Horsepower       33   1   11.71 0.00075 ***
## Weight          229   1   80.95 < 2e-16 ***
## Length           14   1    4.93 0.02737 *  
## Width             0   1    0.03 0.85749    
## Seating          14   1    4.94 0.02724 *  
## Cylinders         3   1    0.95 0.33166    
## Displacement      1   1    0.18 0.66988    
## Transmission      5   2    0.89 0.41282    
## Residuals       606 214                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comments:

  1. Continent is not needed after accounting for all other variables. That says we don’t have evidence to show that one continent tends to produce more efficient cars than the others.
  2. Can we drop all the variables with \(p\)-values larger than \(.05\)?
  3. Take a look at Transmission. There are predominately automatic cars. Perhaps we should only include automatic cars and change our population.
table(data2$Transmission)
## 
##     automatic cont_variable        manual 
##           211             3            12
  1. We don’t have to use a model with all \(p\)-value being small!

Model diagnosis: check if three assumptions of linear models are met.

Residuals vs Fitted and qqnormal Plot, both plots look fine.

par(mfrow=c(1,2))
plot(fit.all, 1) 
plot(fit.all, 2) 

We now provide confidence Interval for the mean of cars like our newcar

fit.mean.new <- predict(fit.all, newcar, interval = "confidence") 
fit.mean.new
##    fit  lwr  upr
## 1 17.7 16.3 19.1

We also provide prediction interval on this newcar

fit.new <- predict(fit.all, newcar, interval = "prediction", se.fit=TRUE)
fit.new$fit
##    fit  lwr  upr
## 1 17.7 14.1 21.3
fit.new$se.fit
## [1] 0.702
fit.new$residual.scale
## [1] 1.68

6 Summary

Summary about linear models:

  1. Linear model is simple and has nice interpretation

  2. Coefficient for each predictor depends on the model

  3. LS automatically adjusts the unit of predictors

  4. \(t\)-tests: the effect of each coeff. (Does not rank the importance of the predictors)

  5. \(F\)-tests: test a set of predictors (t-tests being special cases!)

  6. Model assumptions: all in residuals

  7. The most damaging model violation is the presence of heteroscedasticity. It threatens the validity of each test. But we can use sandwich estimators to get the standard errors. See section Heteroscedasticity in the appendix.

  8. Last but not least: linear models used here only show associations between predictors and the response on average. We CANNOT conclude causality at all.

BIGGEST QUESTION: How to identify a set of important variables or which model to use?????

Choose models with small “prediction errors” is one way to go. We will discuss more in a lecture about model selection.

7 Appendices

7.1 Appendix 1: Clean dataset

We show how we cleaned the dataset Cars_04.csv into car_04_relular.csv

# Caution: you need to have Cars_04.csv in order to run this chunk!!!!!
data = read.csv("data/Cars_04.csv", header=T, as.is=F)
str(data)   # There are 242 cars in the data
data.comp <- na.omit(data)  # It deletes any car with a missing value - not a good idea.
str(data.comp)  #182 cars with complete records for all the variables. 

## Explore the data
names(data)
head(data)
summary(data)

par(mgp=c(1.8,.5,0), mar=c(3,3,2,1)) 
hist(data$Horsepower, breaks=20, col="blue")  # notice that there are some super powerful cars

## Let's find out which are the super powerful cars
data$Make.Model[data$Horsepower > 400] # Show models with Horsepower > 400
data$Make.Model[data$Horsepower > 390]
data$Make.Model[data$Horsepower <= 100]

## We could find cars with horsepower > 400
data[data$Horsepower > 400, "Make.Model"] 

## Let's concentrate on  "regular" cars and exclude those super expensive ones 
datacar <- data[(data$Horsepower <=390) & (data$Horsepower > 100), ]

## Take a subset with relevant variables
variables <-  c("Make.Model", "Continent", "MPG_City", "MPG_Hwy",
              "Horsepower","Weight","Length", "Width",
              "Seating","Cylinders","Displacement",
              "Make", "Transmission")

data1 <- datacar[, variables]  # subset
str(data1)
write.csv(data1, file="car_04_regular.csv", row.names = FALSE)
                            # Output the data and name it car_04_regular.csv

7.2 Appendix 2: Using formulae to obtain OLS estimates

Take fit3 and recreate \(\hat\beta\)’s.

design.x <-  model.matrix(fit3)  #fit3$model gives us Y and x1 x2...xp 
y <- fit3$model[, 1]
design.x[, 1]
rse <- (summary(fit3))$sigma
beta <-  (solve( t(design.x) %*% design.x)) %*% t(design.x) %*% y # reconstruct LS estimates
beta

Reconstructing the covariance matrix from fit3

summary(fit3)$cov.unscaled   # inverse (X' X)
cov.beta <- (rse^2) * (summary(fit3)$cov.unscaled)  # alternatively we can get the cov matrix this way

Calculating the Standard Error from fit3

sd.beta <- sqrt(diag(cov.beta)) # check to see this agrees with the sd for each betahat
sd.beta  # this shoulbe be same as the Std. Error in the summary table

summary(fit3)$coeff[, "Std. Error" ]

7.3 Appendix 3: Heteroscedasticity

Sandwich estimator:

In a linear model, when \(\text{Var}(y_i| x_i)=\sigma^2(x_i)\), i.e., the linear model is hereroscedastic. While the least squared estimator of \(\beta\) is still unbiased but, the usual variance \(\text{Var}( {\bf\hat{\beta}}) \neq \sigma^2 (X^T X)^{-1}\). It has been long known among in econometrics that we can correct the variance estimator by a Sandwich estimator. Since \[\text{Var}( {\bf\hat{\beta}}) = (X^T X)^{-1}(X^T D X) (X^T X)^{-1}\] where \(D\) is a diagonal matrix of \(d_i=\sigma^2_i=\text{Var}(y_i|x_i)\). We of course do not know \(d_i\). So we use squared residual to estimate \(d_i\): \(\hat d_i = (y_i-\hat y_i)^2\). That yields famous Sandwich Estimator due to White (1980): \[\text{Var}( {\bf\hat{\beta}}) = (X^T X)^{-1}(X^T \hat D X) (X^T X)^{-1}\] This correction has been implemented in various packages. We use sandwich with function lmtest::coeftest(fm, df = Inf, vcov = vcovHC). HC means:Heteroscedasticity-Consistent Covariance Matrix Estimation.

To produce a HC corrected confidence intervals for \(\beta\), we first get a linear fit as \(\text{fit} = lm(y~....)\). Then apply: lmtest::coeftest(fit, df = Inf, vcov = vcovHC) to get the correct confidence intervals for \(\beta\).

Remark: Even linearity is not there, we can still fit a linear model. We can interpret the linear function as an approximation to the mean. The confidence intervals obtained using the sandwich estimators are still valid. See our recent work Models as Approximations I: Consequences Illustrated with Linear Regression.

Let us revisit the final model with all variables:

data2 <- select(data1, -Make.Model, -MPG_Hwy, -Make)
fit.all <- lm(MPG_City ~., data2) 
Anova(fit.all)
## Anova Table (Type II tests)
## 
## Response: MPG_City
##              Sum Sq  Df F value  Pr(>F)    
## Continent        10   2    1.79 0.16972    
## Horsepower       33   1   11.71 0.00075 ***
## Weight          229   1   80.95 < 2e-16 ***
## Length           14   1    4.93 0.02737 *  
## Width             0   1    0.03 0.85749    
## Seating          14   1    4.94 0.02724 *  
## Cylinders         3   1    0.95 0.33166    
## Displacement      1   1    0.18 0.66988    
## Transmission      5   2    0.89 0.41282    
## Residuals       606 214                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit.all, 1)

From the residual plot, one may worry the violation of linearity, as well as presence of heteroscedasticity.

To provide a more trust-worthy analysis we may use sandwich standard errors for a more accurate inferences over coefficients.

#sandwich(fit.all)  # gives us HC cov estimator
#vcovHC(fit.all) # also gives us the HC cov matrix
fit.all.hc <- lmtest::coeftest(fit.all, df = Inf, vcov = vcovHC)
fit.all.hc  # output
## 
## z test of coefficients:
## 
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               29.54629    3.62610    8.15  3.7e-16 ***
## ContinentAs                0.46536    0.26515    1.76   0.0793 .  
## ContinentE                 0.05009    0.41556    0.12   0.9041    
## Horsepower                -0.01331    0.00483   -2.76   0.0058 ** 
## Weight                    -3.50280    0.39419   -8.89  < 2e-16 ***
## Length                     0.03313    0.01767    1.87   0.0608 .  
## Width                     -0.01212    0.05920   -0.20   0.8378    
## Seating                    0.31648    0.15399    2.06   0.0399 *  
## Cylinders                 -0.19980    0.23428   -0.85   0.3938    
## Displacement              -0.16762    0.40519   -0.41   0.6791    
## Transmissioncont_variable  1.22163    0.62256    1.96   0.0497 *  
## Transmissionmanual        -0.24546    0.87440   -0.28   0.7789    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let compare regulare lm output and the one with sandwich cov estimation:

stargazer(fit.all, fit.all.hc, type = output_format ) 
Dependent variable:
MPG_City
OLS coefficient
test
(1) (2)
ContinentAs 0.465 0.465*
(0.291) (0.265)
ContinentE 0.050 0.050
(0.428) (0.416)
Horsepower -0.013*** -0.013***
(0.004) (0.005)
Weight -3.500*** -3.500***
(0.389) (0.394)
Length 0.033** 0.033*
(0.015) (0.018)
Width -0.012 -0.012
(0.067) (0.059)
Seating 0.316** 0.316**
(0.142) (0.154)
Cylinders -0.200 -0.200
(0.205) (0.234)
Displacement -0.168 -0.168
(0.393) (0.405)
Transmissioncont_variable 1.220 1.220**
(0.996) (0.623)
Transmissionmanual -0.245 -0.245
(0.556) (0.874)
Constant 29.500*** 29.500***
(3.790) (3.630)
Observations 226
R2 0.803
Adjusted R2 0.793
Residual Std. Error 1.680 (df = 214)
F Statistic 79.200*** (df = 11; 214)
Note: p<0.1; p<0.05; p<0.01

While the sandwich standard errors remain more or less the same as before, we do notice one difference: Asian cars are more efficient than that of American cars at \(\alpha = .1\).

The estimators for each coefficient remain the same!

cov <- vcovHC(fit.all, type = "HC")
robust.se <- sqrt(diag(cov))
stargazer(fit.all, fit.all.hc, type = output_format,
          se = list(NULL, robust.se)) 

Remark: How to incoporate heteroscedasticity in prediction intervals? One easy way out is to fit another model taking residual square as response and use predicted values as the variance give \(x\). We could use a more relaxed non-linear model such as trees/Random Forests.

7.4 Appendix 4: Nonlinearity and transformations

The \(x\) variable may relate to \(y\) through a function of \(x\), say \(log(x)\) or \(x^{2}\).

  • We can handle this by transforming \(x\).
  • First let’s find out if there is some non-linear relationship there by looking at pairwise scatter plots

Based on the following Horsepower may need a transformation

pairs(data1[, -1]) 

We look at Horsepower vs. \(\frac{1}{Horsepower}\)

par(mfrow=c(1,2))
plot(data1$Horsepower, data1$MPG_City, pch=16) 
plot(1/data1$Horsepower, data1$MPG_City, pch=16)  #looks better!

\(\frac{1}{Horsepower}\) Looks good, so we make a model using that.

fit.transform <- lm(MPG_City~ I(1/Horsepower), data1)  # Use I()
plot(fit.transform, 1)

We also try to fit \(Horsepower^{2}\)

fit12 <- lm(MPG_City ~ Horsepower + I(Horsepower^2), data1) # fit a quadratic function
summary(fit12)
plot(fit12$fitted, fit12$residuals, pch=16)
abline(h=0, lwd=4, col="red")

plot(fit12, 2)

7.5 Appendix 5: Model Diagnoses via a perfect linear model

We look to fit a model containing 100 points of the equation

\[y = 1 +2x + \mathcal{N}(0,2) \quad i.e. \quad \beta_0 = 1 ,\beta_1 = 2, \sigma^2 = 2.\]

par(mfrow=c(1,1))

x <- runif(100)
y <- 1+2*x+rnorm(100,0, 2)
fit <- lm(y~x)
fit.perfect <- summary(lm(y~x))
rsquared <- round(fit.perfect$r.squared,2)
hat_beta_0 <- round(fit.perfect$coefficients[1], 2)
hat_beta_1 <- round(fit.perfect$coefficients[2], 2)
plot(x, y, pch=16, 
     ylim=c(-8,8),
     xlab="a perfect linear model: true mean: y=1+2x in blue, LS in red",
     main=paste("R squared= ",rsquared, 
                ", LS estimates b1=",hat_beta_1, "and b0=", hat_beta_0))
abline(fit, lwd=4, col="red")
lines(x, 1+2*x, lwd=4, col="blue")

Residual plot

plot(fit$fitted, fit$residuals, pch=16,
     ylim=c(-8, 8),
     main="residual plot")
abline(h=0, lwd=4, col="red")

Check normality

qqnorm(fit$residuals, ylim=c(-8, 8))
qqline(fit$residuals, lwd=4, col="blue")

The following example tells us we can’t look at y directly to check the normality assumption. Why not? We really need to examine residuals!

par(mfrow=c(2,2))
x <- runif(1000)
y <- 1+20*x+rnorm(1000,0, 2)
plot(x, y, pch=16)
hist(y, breaks=40)
qqnorm(y, main="qqnorm for y")
qqline(y)
fit <- lm(y~x)
qqnorm(fit$residuals, main="qqnorm for residuals")
qqline(fit$residuals)

par(mfrow=c(1,1))

data.frame( mean = mean(y), std.dev = sd(y))

7.6 Appendix 5: Colinearity

When some \(x\)’s are highly correlated we can not separate the effect. But it is still fine for the purpose of prediction.

A simulation to illustrate some consequences of colinearity. Each \(p\)-value for \(x_1\) and \(x_2\) is large but the null of both \(\beta's = 0\) are rejected…. Because \(x_1\) and \(x_2\) are highly correlated.

par(mfrow=c(2,1))
set.seed(1)  
x1 <- runif(100)
x2 <- 2*x1+rnorm(100, 0,.1)    # x1 and x2 are highly correlated
y <- 1+2*x1+rnorm(100,0, .7)   # model
newdata.cor <- cbind(y, x1,x2) # to see the strong correlations..
pairs(newdata.cor, pch=16)

\(x_1\) is a useful predictor

summary(lm(y~x1)) 
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0132 -0.4792 -0.0524  0.4297  1.6787 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.082      0.161    6.72  1.2e-09 ***
## x1             1.872      0.277    6.76  1.0e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.737 on 98 degrees of freedom
## Multiple R-squared:  0.318,  Adjusted R-squared:  0.311 
## F-statistic: 45.7 on 1 and 98 DF,  p-value: 9.96e-10

\(x_2\) is a useful predictor

summary(lm(y~x2))
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9509 -0.5022 -0.0132  0.4831  1.6305 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.112      0.157    7.08  2.2e-10 ***
## x2             0.909      0.134    6.77  9.4e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.737 on 98 degrees of freedom
## Multiple R-squared:  0.319,  Adjusted R-squared:  0.312 
## F-statistic: 45.9 on 1 and 98 DF,  p-value: 9.39e-10

\(x_1\) and \(x_2\) together show that one is not

summary(lm(y~x1+x2)) 
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9818 -0.5091 -0.0376  0.4437  1.6351 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.091      0.162    6.72  1.2e-09 ***
## x1             0.863      1.636    0.53     0.60    
## x2             0.497      0.794    0.63     0.53    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.739 on 97 degrees of freedom
## Multiple R-squared:  0.321,  Adjusted R-squared:  0.307 
## F-statistic: 22.9 on 2 and 97 DF,  p-value: 7.1e-09

Putting both highly correlated var’s together we can’t separate the effect of each one, though the model is still useful!

data.frame(Corealation = cor(x1, x2))
##   Corealation
## 1       0.985

Model dignoses show validities of model assupmtions.

plot(lm(y~x1+x2), 1:2)

7.7 Appendix 6: F-distribution

A quick look at an \(F\) distribution. One may change \(df_1 = 4\) and \(df_2=221\) to see the changes in the distribution.

## 
hist(rf(10000, 6, 226-7), freq=FALSE, breaks=200)   # pretty skewed to the right

Fstat=(summary(fit3))$fstat    # The Fstat, df1, df2
pvalue=1-pf(Fstat[1], 6, 219) #or pf(Fstat[1], 6, 219, lower.tail=FALSE)

 # As long as Fstat is larger than 2.14, we reject the null at .05 level.
data.frame(F_stat = Fstat[1] , pvalue = pvalue , Cutoff = qf(.95, 6, 219))